data structure
The Structural Complexity of Matrix-Vector Multiplication
We consider the problem of preprocessing an n n matrix M, and supporting queries that, for any vector v, returns the matrix-vector product Mv. This problem has been extensively studied in both theory and practice: on one side, practitioners have developed algorithms that are highly efficient in practice, whereas on the other side, theoreticians have proven that the problem cannot be solved faster than naive multiplication in the worst-case. This lower bound holds even in the average-case, implying that existing average-case analyses cannot explain this gap between theory and practice. Hence, we study the problem for structured matrices. We show that for n n Boolean matrices of VC-dimension d, the matrix-vector multiplication problem can be solved with eO(n2)preprocessing and eO(n2 1/d) query time.
Discovering Data Structures: Nearest Neighbor Search and Beyond
We explore if it is possible to learn data structures end-to-end with neural networks, with a focus on the problem of nearest-neighbor (NN) search. We introduce a framework for data structure discovery, which adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, elements of locality-sensitive hashing emerge. Additionally, the model learns useful representations of high-dimensional data such as images and exploits them to design effective data structures. Beyond NN search, we believe the framework could be a powerful tool for data structure discovery for other problems and adapt our framework to the problem of estimating frequencies over a data stream. To encourage future work in this direction, we conclude with a discussion on some of the opportunities and remaining challenges of learning data structures end-to-end.
The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models
Nicoletti, Flavio, Ma, Chenxiao, Ventura, Enrico, Saglietti, Luca, Mannelli, Stefano Sarao
Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transition from an initial phase of generalization to memorizing the training set, existing theory assumes homogeneous data, leaving open how class imbalance and heterogeneity reshape these dynamics. In this work, we develop a high-dimensional analytical framework to study class-dependent learning in score-based diffusion models. Analyzing a random-features model trained on Gaussian mixtures, we derive the feature-covariance spectrum to characterize per-class generalization and memorization times. We reveal the explicit hierarchy governing these dynamics: class variance is the primary determinant of learning order-consistently favoring higher-variance classes-while centroid geometry plays a secondary role. Sampling imbalance acts as a modulator that can reverse this ordering and, under strong imbalance, forces minority classes to acquire distinct, delayed speciation times during backward diffusion. Together, these results suggest that diffusion models can memorize some classes while others remain insufficiently learned. We validate our theoretical predictions empirically using U-Net models trained on Fashion MNIST.
Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations
Graph-based approaches to nearest neighbor search are popular and powerful tools for handling large datasets in practice, but they have limited theoretical guarantees. We study the worst-case performance of recent graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG and DiskANN. For DiskANN, we show that its "slow preprocessing" version provably supports approximate nearest neighbor search query with constant approximation ratio and poly-logarithmic query time, on data sets with bounded "intrinsic" dimension. For the other data structure variants studied, including DiskANN with "fast preprocessing", HNSW and NSG, we present a family of instances on which the empirical query time required to achieve a "reasonable" accuracy is linear in instance size. For example, for DiskANN, we show that the query procedure can take at least 0.1n steps on instances of size nbefore it encounters any of the 5nearest neighbors of the query.
Faster Query Times for Fully Dynamic k-Center Clustering with Outliers
Given a point set P M from a metric space (M,d)and numbers k,z N, the metric k-center problem with z outliers is to find a set C P of k points such that the maximum distance of all but at most z outlier points of P to their nearest center in C is minimized. We consider this problem in the fully dynamic model, i.e., under insertions and deletions of points, for the case that the metric space has a bounded doubling dimension dim. We utilize a hierarchical data structure to maintain the points and their neighborhoods, which enables us to efficiently find the clusters. In particular, our data structure can be queried at any time to generate a (3 + ε)-approximate solution for input values of k and z in worst-case query time ε O(dim)klognloglog, where is the ratio between the maximum and minimum distance between two points in P. Moreover, it allows insertion/deletion of a point in worst-case update time ε O(dim) lognlog . Our result achieves a significantly faster query time with respect to k and z than the current state-of-theart by Pellizzoni, Pietracaprina, and Pucci [18], which uses ε O(dim)(k+z)2 log query time to obtain a (3+ε)-approximate solution.